2/4/2018

Crime Data Dive

Nathan Day

natedayta.com

Why drug crime?

  • Cooler than parking tickets
  • Access to ~31,000 observations, over 5 years!
  • Have you seen The Wire?

Goals of this talk

  • Show distribution of crime in our city
  • Provide an template for analyzing more data
  • Get people excited about using R

Data Process

  1. Open Data Portal - Crime Reports
  2. Geocode via Google Map API
  3. Use library(sf) %>% tidyverse for spatial data exploration
  4. Model and test patterns with library(spdep) and glm()
library(geojsonio) # get ODP data
library(sf) # the new spatial kid
library(spdep) # the spatial grandaddy
library(broom) # extract model info easy
library(magrittr) # %<>% life
library(tidyverse) # duh

Where is crime happening?

Most frequent addresses

Top 3 address for drug crime and not drug crime

arrange(crime_counts, -n) %>% group_by(drug_flag) %>% slice(1:3)
## # A tibble: 6 x 3
## # Groups:   drug_flag [2]
##   address                             drug_flag     n
##   <chr>                               <chr>     <int>
## 1 600 E MARKET ST Charlottesville VA  drugs       410
## 2 400 GARRETT ST Charlottesville VA   drugs        38
## 3 700 PROSPECT AVE Charlottesville VA drugs        38
## 4 600 E MARKET ST Charlottesville VA  not_drugs   635
## 5 700 PROSPECT AVE Charlottesville VA not_drugs   412
## 6 1100 5TH ST SW Charlottesville VA   not_drugs   341

The police station's address is 606 E Market Street….

What is going on at the police station?

"The answer is quite simple - when individuals walk in to the police department to file a report the physical address of the department (606 E Market Street) is often used in that initial report if no other known address is available at the time. This is especially true for incidents of found or lost property near the downtown mall where there is no true known incident location. The same is true for any warrant services that result in a police report occurring at the police department." - CPD

Test if proportions are different

station_props <- arrange(crime_counts, -n) %>%
    group_by(drug_flag) %>%
    add_count(wt = n, name = "nn") %>%
    slice(1)

with(station_props, prop.test(n, nn)) %>% tidy
## # A tibble: 1 x 9
##   estimate1 estimate2 statistic p.value parameter conf.low conf.high method
##       <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr> 
## 1     0.222    0.0218     2135.       0         1    0.181     0.220 2-sam…
## # … with 1 more variable: alternative <chr>

No, they are not.

Aggregate into areas

Census blocks make a lot of sense because:

  • Tons of data in Census and American Community Surveys
  • Reputable source with code books and APIs
  • Easy to access in R via ODP and library(tidycensus)

Start to do it

long_url <- "https://opendata.arcgis.com/datasets/e60c072dbb734454a849d21d3814cc5a_14.geojson"
census <- geojsonio::geojson_read(long_url, what = "sp") %>%
    st_as_sf()

ggplot(census, aes(fill = HU_Vacant / Housing_Units)) + # fill with whatever you want
  geom_sf() +
  scale_fill_viridis_c() # aww yess

Keep doing it

  • Start with geocoded version
crime <- read_csv("https://github.com/NathanCDay/cville_crime/raw/master/crime_geocode.csv")
crime %<>% filter(complete.cases(.))
crime %<>% filter(address != "600 E MARKET ST Charlottesville VA")
  • Convert to sf, with same Coordinate Reference System (critical)
crime %<>% st_as_sf(coords = c("lon", "lat"), crs = st_crs(census))
  • Use sf::st_within() and friends
crime %<>% mutate(within = st_within(crime, census) %>% as.numeric) %>% 
    filter(!is.na(within))

There are bunch of other great st_x(sf_a, sf_b) functions too. If you want to do it, there's a tool for it.

Done with it

  • Flag by interest
crime %<>% mutate(drug_flag = ifelse(grepl("drug", Offense, ignore.case = TRUE),
                                     "drugs", "not_drugs"))
  • Summarise with tidyverse
crime_block <- st_set_geometry(crime, NULL) %>% # remove geometry for spread() to work
    group_by(within, drug_flag) %>%
    count() %>%
    spread(drug_flag, n) %>%
    mutate(frac_drugs = drugs / sum(drugs + not_drugs)) %>%
    ungroup() # geom_sf doesn't care for grouped dfs/tbls
  • Join in
census %<>% inner_join(crime_block, by = c("OBJECTID" = "within"))

Hot blocks

ggplot(census, aes(fill = frac_drugs)) +
    geom_sf() + scale_fill_viridis_c()

Is it random?

Test with Moran's I statistic

## # A tibble: 1 x 5
##   statistic p.value parameter method                            alternative
##       <dbl>   <dbl>     <dbl> <chr>                             <chr>      
## 1     0.213   0.009       991 Monte-Carlo simulation of Moran I greater

Are there other community metrics that are correlated?

Get more data

Median income data comes from the American Community Survey via library(tidycensus) to supplement housing and demographics from the original Census data from ODP.

Which predictors are significant?

  • Use a glm() to fit the highly correlated predictors simultaneously.
mod <- glm(frac_drugs ~ frac_black + income,
           data = census, family = quasibinomial())
  • Indentify the relationships that matter
##                  Estimate   Std. Error     t value     Pr(>|t|)
## (Intercept) -3.455431e+00 2.272112e-01 -15.2080119 9.842326e-17
## frac_black   1.570012e+00 3.546781e-01   4.4265838 9.388946e-05
## income      -5.241156e-07 2.786602e-06  -0.1880842 8.519288e-01

The proportion of the population that is black is significant, but median income is not.

  • A good model should have randomly dispersed residuals.
## # A tibble: 1 x 5
##   statistic p.value parameter method                            alternative
##       <dbl>   <dbl>     <dbl> <chr>                             <chr>      
## 1   -0.0691    0.65       350 Monte-Carlo simulation of Moran I greater

Wrap up

Does drug enforcement target black communities?

More steps:

  • Get data about police patrol locations/frquency

  • Dig deeper on the crime reporting procedure

  • How many of these "drug" crimes are low-level offenses

  • Add temporal elements to the model i.e. seasonal, time of day

Questions?

Thanks for listening